The answer “Raster data is continuous data while vector data is discrete data.” is not complete: a raster of land use type represens a discrete (type) variable, a polygon map with population density represents a continuous variable. The difference lies in spatially continuous variables like elevation or temperature which are more easily represented by raster data, and spatially discrete features, such as houses and roads, which are easier represented by vector data.
The values shown in figure 1.4 are population total associated with their respective counties. Without the county boundaries the meaning disappears: raster pixels do not contain population totals per pixel, population totals over larger regions or populations densities can no longer be derived based on this raster map alone.
(thanks to Jonas Hurst)
Convert the (x, y) point s (10, 2), (-10, -1), (10, -2) and (0, 10) to polar cooridnates
cart2polar = function(x, y){
r = sqrt(x*x + y*y) # compute r (distance from origin)
phi = atan2(y, x) # compute phi (angle between point and positive x axis in rad)
phi_deg = phi * 180 / pi # compute angle in deg
result = c(r, phi_deg)
return(result)
}
cart2polar(10, 2)
## [1] 10.19804 11.30993
cart2polar(-10, -1)
## [1] 10.04988 -174.28941
cart2polar(10, -2)
## [1] 10.19804 -11.30993
cart2polar(0, 10)
## [1] 10 90
Convert the polar (r, phi) points (10, 45°), (0, 100°) and (5, 259°) to Cartesian coordinates
deg2rad = function(angle_degree) {
angle_degree * pi / 180
}
polar2cart = function(r, phi_deg){
# phi must be in degrees
phi_rad = deg2rad(phi_deg) # convert phi in degrees to radians
x = r * cos(phi_rad)
y = r * sin(phi_rad)
c(x, y) # return value
}
polar2cart(10, 45)
## [1] 7.071068 7.071068
polar2cart(0, 100)
## [1] 0 0
polar2cart(5, 259)
## [1] -0.954045 -4.908136
assuming the Earth is a sphere with a radius of 6371 km, compute for (lambda, phi) points the great circle distance between (10, 10) and (11, 10), between (10, 80) >and (11, 80), between (10, 10) and (10, 11) and between (10, 80) and (10, 81).
distOnSphere = function(l1, phi1, l2, phi2, radius) {
l1_rad = deg2rad(l1)
l2_rad = deg2rad(l2)
phi1_rad = deg2rad(phi1)
phi2_rad = deg2rad(phi2)
theta = acos(
sin(phi1_rad) * sin(phi2_rad) +
cos(phi1_rad) * cos(phi2_rad) * cos(abs(l1_rad - l2_rad))
)
radius * theta # return value
}
radius = 3671
distOnSphere(10, 10, 11, 10, radius)
## [1] 63.09763
distOnSphere(10, 80, 11, 80, radius)
## [1] 11.12568
distOnSphere(10, 10, 10, 11, radius)
## [1] 64.07104
distOnSphere(10, 80, 10, 81, radius)
## [1] 64.07104
Unit of all results are kilometers.
(thanks to Jannis Fröhlking)
library(sf)
## Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
x1 <- st_linestring(rbind(c(0,0),c(2,2),c(0,2),c(2,0)))
x2 <- st_polygon(list(rbind(c(3,0),c(5,2),c(3,2),c(5,0),c(3,0))))
plot(c(x1,x2), col = 2:3)
st_is_simple(x1)
## [1] FALSE
st_is_simple(x2)
## [1] FALSE
for(i in c(1,1e3,1e6,1e-2))
print(round(i * c(10.542, 0.01, 45321.789))/i)
## [1] 11 0 45322
## [1] 10.542 0.010 45321.789
## [1] 10.542 0.010 45321.789
## [1] 0 0 45300
Voronoi diagrams have “open polygons”, areas that extend into infinity, for boundary points. These cannot be represented by simple feature geometries. st_voronoi chooses a default (square) polygon to limit the extent, which can be enlarged. Alternatively, the extent can be limited using st_intersection on its result:
library(sf)
par(mfrow = c(2,2))
set.seed(133331)
mp = st_multipoint(matrix(runif(20), 10))
plot(st_voronoi(mp), col = NA, border = 'black')
plot(mp, add = TRUE)
title("default extent")
e2 = st_polygon(list(rbind(c(-5,-5), c(5, -5), c(5,5), c(-5, 5), c(-5,-5))))
plot(st_voronoi(mp, envelope = e2), col = NA, border = 'black')
plot(mp, add = TRUE)
title("enlarged envelope")
e3 = st_polygon(list(rbind(c(0,0), c(1, 0), c(1, 1), c(0, 1), c(0, 0))))
v = st_voronoi(mp) %>% st_collection_extract() # pulls POLYGONs out of GC
plot(st_intersection(v, e3), col = NA, border = 'black', axes=TRUE)
plot(mp, add = TRUE)
title("smaller, intersected envelope")
st_dimension(st_point(c(0,1,1)))
## [1] 0
st_dimension(st_linestring(rbind(c(0,1,1),c(1,1,2))))
## [1] 1
st_dimension(st_polygon(list(rbind(c(0,0,0),c(1,0,0),c(1,1,0),c(0,0,0)))))
## [1] 2
(these are all zero-dimensional geometries because they are points, irrespective the number of dimensions they’re defined in)
line_1 = st_linestring(rbind(c(0,0),c(1,0)))
line_2 = st_linestring(rbind(c(.5,0),c(.5,1)))
plot(line_1,col = "green")
plot(line_2,col = "red", add = TRUE)
st_relate(line_1, line_2)
## [,1]
## [1,] "F01FF0102"
The DE-9IM relation is F01FF0102
(the boundary of a LINESTRING is formed by its two end points)
Yes, but I would say that the set may just contain one polygon, because simple features provide no way of assigning points on the boundary of two adjacent polygons to a single polygon.
# read data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
## Reading layer `nc' from data source
## `/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/shape/nc.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS: NAD27
# get intersections
(nc_geom = st_geometry(nc))
## Geometry set for 100 features
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS: NAD27
## First 5 geometries:
## MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3...
## MULTIPOLYGON (((-81.23989 36.36536, -81.24069 3...
## MULTIPOLYGON (((-80.45634 36.24256, -80.47639 3...
## MULTIPOLYGON (((-76.00897 36.3196, -76.01735 36...
## MULTIPOLYGON (((-77.21767 36.24098, -77.23461 3...
nc_ints = st_intersection(nc_geom)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
plot(nc_ints, main = "All intersections")
# Function to check class of intersection objects
get_points = function(x){
if(class(x)[2]=="POINT") return(x)
}
# get points
points = lapply(nc_ints, get_points)
points[sapply(points,is.null)] <- NULL
sf_points = st_sfc(points)
st_crs(sf_points) = st_crs(nc)
# get points with four neighbouring geometries (=states)
touch = st_touches(sf_points, nc_geom)
four_n = sapply(touch, function(y) which(length(y)==4))
names(four_n) = seq_along(four_n)
point_no = array(as.numeric(names(unlist(four_n))))
result = st_sfc(points[point_no])
plot(nc_geom, main = "Points touched by four counties")
plot(result, add = TRUE, col = "red", pch = 10, cex = 2)
A more compact way might be to search for points where counties touch another county only in a point, which can be found using st_relate using a pattern:
(pts = nc %>% st_relate(pattern = "****0****"))
## although coordinates are longitude/latitude, st_relate_pattern assumes that they are planar
## Sparse geometry binary predicate list of length 100, where the
## predicate was `relate_pattern'
## first 10 elements:
## 1: (empty)
## 2: (empty)
## 3: (empty)
## 4: (empty)
## 5: (empty)
## 6: (empty)
## 7: (empty)
## 8: (empty)
## 9: 31
## 10: 26
nc %>% st_relate(pattern = "****0****") %>% lengths() %>% sum()
## although coordinates are longitude/latitude, st_relate_pattern assumes that they are planar
## [1] 28
which is, as expected, four times the number of points shown in the plot above.
How can we find these points? See here:
nc = st_geometry(nc)
s2 = sf_use_s2(FALSE) # use GEOM geometry
## Spherical geometry (s2) switched off
pts = st_intersection(nc, nc)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
pts = pts[st_dimension(pts) == 0]
plot(st_geometry(nc))
plot(st_geometry(pts), add = TRUE, col = "red", pch = 10, cex = 2)
sf_use_s2(s2) # set back
## Spherical geometry (s2) switched on
Only cells that were fully crossed by the red line would be grey:
library(stars)
## Loading required package: abind
ls = st_sf(a = 2, st_sfc(st_linestring(rbind(c(0.1, 0), c(1, .9)))))
grd = st_as_stars(st_bbox(ls), nx = 10, ny = 10, xlim = c(0, 1.0), ylim = c(0, 1),
values = -1)
attr(grd, "dimensions")$y$delta = .1
attr(grd, "dimensions")$y$offset = 0
r = st_rasterize(ls, grd, options = "ALL_TOUCHED=TRUE")
r[r == -1] = NA
plot(st_geometry(st_as_sf(grd)), border = 'orange', col = NA,
reset = FALSE, key.pos=NULL)
plot(r, axes = TRUE, add = TRUE, breaks = "equal") # ALL_TOUCHED=FALSE;
plot(ls, add = TRUE, col = "red", lwd = 2)
The reason is that in this case, lower left corners of grid cells are part of the cell, rather than upper left corners.
How does the GeoJSON format define “straight” lines between ellipsoidal coordinates (section 3.1.1)? Using this definition of straight, how would LINESTRING(0 85,180 85) look like in a polar projection? How could this geometry be modified to have it cross the North Pole?
GeoJSON defines straight lines between pairs of ellipsoidal coordinates as the straight line in Cartesian space formed by longitude and latitude. This means e.g. that all parallels are straight lines.
Using this definition of straight, how would LINESTRING(0 85,180 85) look like in a polar projection?
Like a half circle:
library(sf)
l <- st_as_sfc("LINESTRING(0 85,180 85)") %>%
st_segmentize(1) %>%
st_set_crs('EPSG:4326')
plot(st_transform(l, 'EPSG:3995'), col = 'red', lwd = 2,
graticule = TRUE, axes = TRUE, reset = FALSE)
How could this geometry be modified to have it cross the North Pole?
One would have to let it pass through (0 90) and (180, 90):
library(sf)
l <- st_as_sfc("LINESTRING(0 85,0 90,180 90,180 85)") %>%
st_segmentize(1) %>%
st_set_crs('EPSG:4326')
plot(st_transform(l, 'EPSG:3995'), col = 'red', lwd = 2,
graticule = TRUE, axes = TRUE, reset = FALSE)
Ring direction (clock-wise CW, counter clock-wise CCW) is unambiguous on \(R^2\) but not on \(S^2\): on \(S^2\) every polygon divides the sphere’s surface in two parts. When the inside of the polygon is taken as the area to the left when traversing the polygons’s points then for a small polygon, then ring direction is CCW if the area of the polygon is smaller than half of the area of the sphere. For polygons dividing the sphere in two equal parts (great circles such as the equator or meridians) ring direction is ambiguous.
Bounding caps may be more compact (have a smaller area compared to the bounding box corresponding to the same geometries), they need fewer parameters, and they are invariant under rotation of (the origins of) longitude and latitude.
For areas covering one of the poles, a bounding box will always need to have a longitude range that spans from -180 to 180, irrespective whether the geometry is centered around the pole.
Because that is the closest approximation of the geometry on \(R^2\).
For rnaturalearth::ne_countries(country = "Fiji", returnclass="sf"), check whether the geometry is valid on \(R^2\), on an orthographic projection centered on the country, and on \(S^2\). How can the geometry be made valid on S^2? Plot the resulting geometry back on \(R^2\). Compare the centroid of the country, as computed on \(R^2\) and on \(S^2\), and the distance between the two.
Valid on \(R^2\):
fi = rnaturalearth::ne_countries(country = "Fiji", returnclass="sf") %>%
st_geometry()
s2 = sf_use_s2(FALSE)
## Spherical geometry (s2) switched off
st_is_valid(fi)
## [1] TRUE
Valid on orthographic projection:
ortho = "+proj=ortho +lon_0=178.6 +lat_0=-17.3"
st_transform(fi, ortho) %>% st_is_valid()
## [1] FALSE
plot(st_transform(fi, ortho), border = 'red')
The red line following the antimeridian makes the geometry invalid in this projection, and also on \(S^2\):
sf_use_s2(TRUE)
## Spherical geometry (s2) switched on
st_is_valid(fi)
## [1] FALSE
Make valid on \(S^2\), and plot:
fi.s2 = st_make_valid(fi)
st_is_valid(fi.s2)
## [1] TRUE
plot(st_transform(fi.s2, ortho), border = 'red')
title("valid")
where we see that the line at the antimeridian has disappeared. This makes plotting in \(R^2\) look terrible, with lines spanning the globe:
plot(fi.s2, axes = TRUE)
Compare the centroid of the country, as computed on \(R^2\) and on \(S^2\), and the distance between the two.
sf_use_s2(FALSE)
## Spherical geometry (s2) switched off
(c1 = st_centroid(fi))
## Warning in st_centroid.sfc(fi): st_centroid does not give correct centroids for
## longitude/latitude data
## Geometry set for 1 feature
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 163.8532 ymin: -17.31631 xmax: 163.8532 ymax: -17.31631
## CRS: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## POINT (163.8532 -17.31631)
sf_use_s2(TRUE)
## Spherical geometry (s2) switched on
(c2 = st_centroid(fi.s2))
## Geometry set for 1 feature
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 178.5684 ymin: -17.31562 xmax: 178.5684 ymax: -17.31562
## CRS: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## POINT (178.5684 -17.31562)
st_distance(c1, c2)
## Units: [m]
## [,1]
## [1,] 1561723
sf_use_s2(s2)
StateThe appropriate value would be constant: there is no identity relationship of State to one of the counties in nc, and the value of State is constant through each county in the state (every point in every county in the state has this value for State).
State for the entire stateNow, the unioned geometry is that of the state, and we can assign identity: there is only one state of North Carolina, an this geometry is its geometry.
AREA variableThe nc dataset is rather old, and did not come with an extensive report how, in detail, certain variables such as AREA were derived, so some detective work is needed here. How did people do this, more than three decades ago?
We can now compute area by
library(sf)
nc = read_sf(system.file("gpkg/nc.gpkg", package="sf"))
nc$AREA[1:10]
## [1] 0.114 0.061 0.143 0.070 0.153 0.097 0.062 0.091 0.118 0.124
s2 = sf_use_s2(FALSE) # use spherical geometry:
## Spherical geometry (s2) switched off
nc$area = a_sph = st_area(nc)
nc$area[1:10]
## Units: [m^2]
## [1] 1137388604 611077263 1423489919 694546292 1520740530 967727952
## [7] 615942210 903650119 1179347051 1232769242
sf_use_s2(TRUE) # use ellipsoidal geometry:
## Spherical geometry (s2) switched on
nc$area = a_ell = st_area(nc)
nc$area[1:10]
## Units: [m^2]
## [1] 1137107793 610916077 1423145355 694378925 1520366979 967504822
## [7] 615794941 903423919 1179065710 1232475139
sf_use_s2(s2) # set back to original
cor(a_ell, a_sph)
## [1] 0.9999999
and this gives the area, in square metres, computed using either ellipsoidal or spherical geometry. We see that these are not identical, but nearly perfectly linearly correlated.
A first hypothesis might be a constant factor between the area and AREA variables. For this, we could try a power of 10:
nc$area2 = units::drop_units(nc$area / 1e10)
cor(nc$AREA, nc$area2)
## [1] 0.9998116
summary(lm(area2 ~ AREA, nc))
##
## Call:
## lm(formula = area2 ~ AREA, data = nc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.281e-03 -6.279e-04 6.328e-05 5.495e-04 2.746e-03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0009781 0.0002692 -3.633 0.000448 ***
## AREA 1.0138124 0.0019882 509.920 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0009733 on 98 degrees of freedom
## Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996
## F-statistic: 2.6e+05 on 1 and 98 DF, p-value: < 2.2e-16
plot(area2 ~ AREA, nc)
abline(0, 1)
and we see a pretty good, close to 1:1 correspondence! But the factor 1e10 is strange: it does not convert square metres into a usual unit for area, neither for metric nor for imperial units.
Also, there are deviations from the 1:1 regression line. Could these be explained by the rounding of AREA to three digits? If rounding to three digits was the only cause of spread around the regression line, we would expect a residual standard error similar to the standard deviation of a uniform distribution with width .001, which is
sqrt(0.001^2/12)
## [1] 0.0002886751
but the one obtained int he regression is three times larger. Also, the units of AREA would be 1e10 \(m^2\), or 1e4 \(km^2\), which is odd and could ring some bells: one degree latitude corresponds roughly to 111 km, so one “square degree” at the equator corresponds roughly to \(1.11^2 \times 10^4\), and at 35 degrees North roughly to
111 ^ 2 * cos(35 / 180 * pi)
## [1] 10092.77
which closely corresponds to the regression slope found above.
We can compute “square degree” area by using the \(R^2\) area routines, e.g. obtained when we set the CRS to NA:
nc2 = nc
st_crs(nc2) = NA
nc2$area = st_area(nc2) # "square degrees"
plot(area ~ AREA, nc2)
abline(0,1)
cor(nc2$area, nc2$AREA)
## [1] 0.999983
summary(lm(area ~ AREA, nc2))
##
## Call:
## lm(formula = area ~ AREA, data = nc2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.471e-04 -2.265e-04 -9.880e-06 2.714e-04 4.594e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.436e-05 7.965e-05 0.934 0.353
## AREA 9.996e-01 5.882e-04 1699.395 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0002879 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.888e+06 on 1 and 98 DF, p-value: < 2.2e-16
We now get a much better fit, a near perfect correlation, and a regression standard error that corresponds exactly to what one would expect after rounding AREA to three digits.
A further “red flag” against the constant (1e10) conversion hypothesis is the spatial pattern of the regression residuals obtained by the first approach:
nc$resid = residuals(lm(area2 ~ AREA, nc))
plot(nc["resid"])
these residuals clearly show a North-South trend, corresponding to the effect that the Earth’s curvature has been ignored during the computation of AREA (ellipsoidal coordinates were treated as if they were Cartesian). “Square degrees” become smaller when going north.
The “unit” of the AREA variable is hence “square degree”. This is a meaningless unit for area on the sphere, because a unit square degree does not have a constant area.
area“area” is of type aggregate: it is a property of a polygon as a whole, not of each individual point in the polygon. It is extensive: if we cut a polygon in two parts, the total area is distributed over the parts.
From the on-line version of the book we get the code that created the plot:
g = st_make_grid(st_bbox(st_as_sfc("LINESTRING(0 0,1 1)")), n = c(2,2))
par(mar = rep(0,4))
plot(g)
plot(g[1] * diag(c(3/4, 1)) + c(0.25, 0.125), add = TRUE, lty = 2)
text(c(.2, .8, .2, .8), c(.2, .2, .8, .8), c(1,2,4,8), col = 'red')
A question is how we can make g into an sf object with the right attribute values associated with the right geometries. We try values 1:4:
sf = st_sf(x = 1:4, geom = g)
plot(sf)
and see the order of the geometries: row-wise, bottom row first, so
sf = st_sf(x = c(1,2,4,8), geom = g)
plot(sf)
gives us the source object. We create target geometries by
dashed = g[1] * diag(c(3/4, 1)) + c(0.25, 0.125)
box = st_union(g)
c(dashed, box)
## Geometry set for 2 features
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 0 ymin: 0 xmax: 1 ymax: 1
## CRS: NA
## POLYGON ((0.25 0.125, 0.625 0.125, 0.625 0.625,...
## POLYGON ((1 0, 0.5 0, 0 0, 0 0.5, 0 1, 0.5 1, 1...
and can call st_interpolate_aw to compute the area-weighted interpolations:
st_interpolate_aw(sf, c(dashed, box), extensive = TRUE)
## Warning in st_interpolate_aw.sf(sf, c(dashed, box), extensive = TRUE):
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
## Simple feature collection with 2 features and 1 field
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 0 ymin: 0 xmax: 1 ymax: 1
## CRS: NA
## x geometry
## 1 1.75 POLYGON ((0.25 0.125, 0.625...
## 2 15.00 POLYGON ((1 0, 0.5 0, 0 0, ...
st_interpolate_aw(sf, c(dashed, box), extensive = FALSE)
## Warning in st_interpolate_aw.sf(sf, c(dashed, box), extensive = FALSE):
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
## Simple feature collection with 2 features and 1 field
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 0 ymin: 0 xmax: 1 ymax: 1
## CRS: NA
## x geometry
## 1 2.333333 POLYGON ((0.25 0.125, 0.625...
## 2 3.750000 POLYGON ((1 0, 0.5 0, 0 0, ...
This generates a warning, which we can get rid of by setting the agr to constant:
st_agr(sf) = "constant"
st_interpolate_aw(sf, c(dashed, box), FALSE)
## Simple feature collection with 2 features and 1 field
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 0 ymin: 0 xmax: 1 ymax: 1
## CRS: NA
## x geometry
## 1 2.333333 POLYGON ((0.25 0.125, 0.625...
## 2 3.750000 POLYGON ((1 0, 0.5 0, 0 0, ...
Why is it difficult to represent trajectories, sequences of \((x,y,t)\) obtained by tracking moving objects, by data cubes as described in this chapter?
In a socio-economic vector data cube with variables population, life expectancy, and gross domestic product ordered by dimensions country and year, which variables have block support for the spatial dimension, and which have block support for the temporal dimension?
The Sentinel-2 satellites collect images in 12 spectral bands; list advantages and disadvantages to represent them as (i) different data cubes, (ii) a data cube with 12 attributes, one for each band, and (iii) a single attribute data cube with a spectral dimension.
Explain why a curvilinear raster as shown in figure 1.5 can be considered a special case of a data cube.
Explain how the following problems can be solved with data cube operations filter, apply, reduce and/or aggregate, and in which order. Also mention for each which function is applied, and what the dimensionality of the resulting data cube is (if any):
from hourly \(PM_{10}\) measurements for a set of air quality monitoring stations, compute per station the amount of days per year that the average daily \(PM_{10}\) value exceeds 50 \(\mu g/m^3\)
This gives a one-dimensional data cube, with dimension “station”
for a sequence of aerial images of an oil spill, find the time at which the oil spill had its largest extent, and the corresponding extent
This gives a zero-dimensional data cube (a scalar).
from a 10-year period with global daily sea surface temperature (SST) raster maps, find the area with the 10% largest and 10% smallest temporal trends in SST values.
lm)This gives a two-dimensional data cube (or raster layer: the reclassified trend raster).
Find the names of the nc counties that intersect LINESTRING(-84 35,-78 35); use [ for this, and use st_join() for this.
(file = system.file("gpkg/nc.gpkg", package="sf"))
## [1] "/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/gpkg/nc.gpkg"
nc = st_read(file)
## Reading layer `nc.gpkg' from data source
## `/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/gpkg/nc.gpkg'
## using driver `GPKG'
## Simple feature collection with 100 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS: NAD27
line = st_as_sfc("LINESTRING(-84 35,-78 35)", crs = st_crs(nc))
nc[line,]$NAME
## [1] "Jackson" "Mecklenburg" "Macon" "Sampson" "Cherokee"
## [6] "Cumberland" "Union" "Anson" "Hoke" "Duplin"
## [11] "Richmond" "Clay" "Scotland"
st_join(st_sf(line), nc)$NAME # left join: `line` should be first argument
## [1] "Jackson" "Mecklenburg" "Macon" "Sampson" "Cherokee"
## [6] "Cumberland" "Union" "Anson" "Hoke" "Duplin"
## [11] "Richmond" "Clay" "Scotland"
Repeat this after setting sf_use_s2(FALSE), and compute the difference (hint: use setdiff()), and color the counties of the difference using color ‘#00880088’.
# save names first:
sf_use_s2(TRUE)
names_with_s2 = nc[line,]$NAME
sf_use_s2(FALSE)
## Spherical geometry (s2) switched off
nc[line,]$NAME
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## [1] "Macon" "Sampson" "Cherokee" "Cumberland" "Union"
## [6] "Anson" "Hoke" "Duplin" "Richmond" "Clay"
## [11] "Scotland"
(diff = setdiff(names_with_s2, nc[line,]$NAME))
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## [1] "Jackson" "Mecklenburg"
plot(st_geometry(nc))
plot(st_geometry(nc)[nc$NAME %in% diff], col = "#00880088", add = TRUE)
Plot the two different lines in a single plot; note that R will plot a straight line always straight in the projection currently used; st_segmentize can be used to add points on straight line, or on a great circle for ellipsoidal coordinates.
plot(st_geometry(nc))
plot(st_geometry(nc)[nc$NAME %in% diff], col = "#00880088", add = TRUE)
plot(line, add = TRUE)
plot(st_segmentize(line, units::set_units(10, km)), add = TRUE, col = 'red')
To show that the red line is curved, but only curved in plate carree, and not e.g. in an orthographic projection centered at this region, we can also plot it in an orthographic projection:
l.gc = st_segmentize(line, units::set_units(10, km))
l.pc = st_segmentize(st_set_crs(line, NA), 0.1) %>% st_set_crs(st_crs(l.gc))
o = st_crs("+proj=ortho +lon_0=-80 +lat_0=35")
plot(st_transform(st_geometry(nc), o), axes = TRUE)
plot(st_transform(st_geometry(nc), o)[nc$NAME %in% diff],
col = "#00880088", add = TRUE)
plot(st_transform(l.gc, o), col = 'red', add = TRUE)
plot(st_transform(l.pc, o), col = 'black', add = TRUE)
plot(st_transform(line, o), col = 'green', add = TRUE)
The fact that the unsegmented line line is straight (R plotted it as straight, it contains only the two endpoints) and that it covers the red line supports that in this plot, the great circle line (red) is plotted straight, and the “straight in plate carree” line is not.
NDVI, normalized differenced vegetation index, is computed as (NIR-R)/(NIR+R), with NIR the near infrared and R the red band. Read the L7_ETMs.tif file into object x, and distribute the band dimensions over attributes by split(x, "band"). Then, add attribute NDVI to this object by using an expression that uses the NIR (band 4) and R (band 3) attributes directly.
library(stars)
(x = read_stars(system.file("tif/L7_ETMs.tif", package = "stars")))
## stars object with 3 dimensions and 1 attribute
## attribute(s):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## L7_ETMs.tif 1 54 69 68.91242 86 255
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 349 288776 28.5 UTM Zone 25, Southern Hem... FALSE NULL [x]
## y 1 352 9120761 -28.5 UTM Zone 25, Southern Hem... FALSE NULL [y]
## band 1 6 NA NA NA NA NULL
(x.spl = split(x))
## stars object with 2 dimensions and 6 attributes
## attribute(s):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## X1 47 67 78 79.14772 89 255
## X2 32 55 66 67.57465 79 255
## X3 21 49 63 64.35886 77 255
## X4 9 52 63 59.23541 75 255
## X5 1 63 89 83.18266 112 255
## X6 1 32 60 59.97521 88 255
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 349 288776 28.5 UTM Zone 25, Southern Hem... FALSE NULL [x]
## y 1 352 9120761 -28.5 UTM Zone 25, Southern Hem... FALSE NULL [y]
x.spl$NDVI = (x.spl$X4 - x.spl$X3)/(x.spl$X4 + x.spl$X3)
plot(x.spl["NDVI"])
Compute NDVI for the L7_ETMs.tif image by reducing the band dimension, using st_apply and an a function ndvi = function(x) { (x[4]-x[3])/(x[4]+x[3]) }. Plot the result, and write the result to a GeoTIFF.
ndvi_fn = function(x) { (x[4]-x[3])/(x[4]+x[3]) }
ndvi = st_apply(x, 1:2, ndvi_fn)
plot(ndvi)
write_stars(ndvi, "ndvi.tif")
an alternative function is
ndvi_fn = function(x1,x2,x3,x4,x5,x6) { (x4-x3)/(x4+x3) }
ndvi2 = st_apply(x, 1:2, ndvi_fn)
all.equal(ndvi, ndvi2)
## [1] TRUE
This latter function can be much faster, as it is called for chunks of data rather than for individual pixels.
Use st_transform to transform the stars object read from L7_ETMs.tif to EPSG:4326. Print the object. Is this a regular grid? Plot the first band using arguments axes=TRUE and explain why this takes such a long time.
(x_t = st_transform(x, 'EPSG:4326'))
plot(x_t[,,,1], axes = TRUE)
the report says it is a curvilinear grid. Plotting takes so long because for curvilinear grids, each cell is converted to a small polygon and then plotted.
Use st_warp to warp the L7_ETMs.tif object to EPSG:4326, and plot the resulting object with axes=TRUE. Why is the plot created much faster than after st_transform?
x_w = st_warp(x, crs = 'EPSG:4326')
plot(x_w[,,,1], reset = FALSE)
plot(st_as_sfc(st_bbox(x_w)), col = NA, border = 'red', add = TRUE)
Plotting is faster now because we created a new regular grid. Note that the grid border does not align perfectly with the square formed by the bounding box (using straight lines in an equidistant rectangular projection): white grid cells indicate the misalignment due to warping/transforming.
Using a vector representation of the raster L7_ETMs, plot the intersection with a circular area around POINT(293716 9113692) with radius 75 m, and compute the area-weighted mean pixel values for this circle. Compare the area-weighted values with those obtained by aggregate using the vector data, and by aggregate using the raster data, using exact=FALSE (default) and exact=FALSE. Explain the differences.
l7 = st_as_sf(x)
st_agr(l7) = "constant"
a = st_as_sfc("POINT(293716 9113692)", crs = st_crs(l7)) %>%
st_buffer(units::set_units(74, m))
plot(st_intersection(l7, a))
(aw = st_interpolate_aw(l7, a, mean, extensive = FALSE))
## Simple feature collection with 1 feature and 6 fields
## Attribute-geometry relationship: 0 constant, 6 aggregate, 0 identity
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 293642 ymin: 9113618 xmax: 293790 ymax: 9113766
## Projected CRS: UTM Zone 25, Southern Hemisphere
## L7_ETMs.tif.V1 L7_ETMs.tif.V2 L7_ETMs.tif.V3 L7_ETMs.tif.V4 L7_ETMs.tif.V5
## 1 88.54362 73.50115 80.69708 49.98497 111.8357
## L7_ETMs.tif.V6 geometry
## 1 97.56707 POLYGON ((293790 9113692, 2...
(ag_vector = aggregate(l7, a, mean))
## Simple feature collection with 1 feature and 6 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 293642 ymin: 9113618 xmax: 293790 ymax: 9113766
## Projected CRS: UTM Zone 25, Southern Hemisphere
## L7_ETMs.tif.V1 L7_ETMs.tif.V2 L7_ETMs.tif.V3 L7_ETMs.tif.V4 L7_ETMs.tif.V5
## 1 88.0303 72.84848 79.30303 50.33333 111.9697
## L7_ETMs.tif.V6 geometry
## 1 96.72727 POLYGON ((293790 9113692, 2...
(ag_rasterF = st_as_sf(aggregate(x, a, mean)))
## Simple feature collection with 1 feature and 6 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 293642 ymin: 9113618 xmax: 293790 ymax: 9113766
## Projected CRS: UTM Zone 25, Southern Hemisphere
## L7_ETMs.tif.V1 L7_ETMs.tif.V2 L7_ETMs.tif.V3 L7_ETMs.tif.V4 L7_ETMs.tif.V5
## 1 88.6 73.55 81.1 49.55 112.1
## L7_ETMs.tif.V6 geometry
## 1 98.4 POLYGON ((293790 9113692, 2...
(ag_rasterT = st_as_sf(aggregate(x, a, mean, exact = TRUE)))
## Simple feature collection with 1 feature and 6 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 293642 ymin: 9113618 xmax: 293790 ymax: 9113766
## Projected CRS: UTM Zone 25, Southern Hemisphere
## L7_ETMs.tif.V1 L7_ETMs.tif.V2 L7_ETMs.tif.V3 L7_ETMs.tif.V4 L7_ETMs.tif.V5
## 1 88.54362 73.50115 80.69708 49.98497 111.8357
## L7_ETMs.tif.V6 sfc
## 1 97.56707 POLYGON ((293790 9113692, 2...
rbind(st_drop_geometry(aw),
st_drop_geometry(ag_vector),
st_drop_geometry(ag_rasterF),
st_drop_geometry(ag_rasterT))
## L7_ETMs.tif.V1 L7_ETMs.tif.V2 L7_ETMs.tif.V3 L7_ETMs.tif.V4 L7_ETMs.tif.V5
## 1 88.54362 73.50115 80.69708 49.98497 111.8357
## 11 88.03030 72.84848 79.30303 50.33333 111.9697
## 12 88.60000 73.55000 81.10000 49.55000 112.1000
## 13 88.54362 73.50115 80.69708 49.98497 111.8357
## L7_ETMs.tif.V6
## 1 97.56707
## 11 96.72727
## 12 98.40000
## 13 97.56707
Area-weighted interpolation computes the area-weighted mean of the areas shown in the plot; aggregate on the vector values computes the unweighted mean over all polygonized pixels that intersect with the circle (black lines); aggregate on the raster values only averages (unweighted) the cells with pixel centers intersecting with the circle (light red):
plot(st_geometry(l7)[a])
plot(a, add = TRUE, col = NA, border = 'red')
plot(st_as_sf(L7_ETMs[a])[1], add = TRUE, col = '#ff000066')
plot(st_as_sf(L7_ETMs[a], as_points = TRUE)[1], add = TRUE, pch = 3, col = 1)
For the S2 image (above), find out in which order the bands are using st_get_dimension_values(), and try to find out (e.g. by internet search) which spectral bands / colors they correspond to.
f = "sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip"
granule = system.file(file = f, package = "starsdata")
file.size(granule)
## [1] 769461997
base_name = strsplit(basename(granule), ".zip")[[1]]
s2 = paste0("SENTINEL2_L1C:/vsizip/", granule, "/", base_name,
".SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
library(stars)
(p = read_stars(s2, proxy = TRUE))
## stars_proxy object with 1 attribute in 1 file(s):
## $`MTD_MSIL1C.xml:10m:EPSG_32632`
## [1] "[...]/MTD_MSIL1C.xml:10m:EPSG_32632"
##
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 10980 3e+05 10 WGS 84 / UTM zone 32N NA NULL [x]
## y 1 10980 6e+06 -10 WGS 84 / UTM zone 32N NA NULL [y]
## band 1 4 NA NA NA NA B4,...,B8
st_get_dimension_values(p, "band")
## [1] "B4" "B3" "B2" "B8"
Compute NDVI for the S2 image, using st_apply and an an appropriate ndvi function. Plot the result to screen, and then write the result to a GeoTIFF. Explain the difference in runtime between plotting and writing.
ndvi_fn = function(r, g, b, nir) (nir-r)/(nir+r)
ndvi = st_apply(p, 1:2, ndvi_fn)
plot(ndvi)
## downsample set to 18
Alternatively, one could use
ndvi_fn = function(r, g, b, nir) (nir-r)/(nir+r)
but that is much less efficient. Write to a tiff:
system.time(write_stars(ndvi, "ndvi.tif"))
## ================================================================================
## user system elapsed
## 236.838 8.164 63.262
The runtime difference is caused by the fact that plot downsamples, so computes a very small fraction of the available pixels, where write_stars computes all pixels, and then writes them.
Plot an RGB composite of the S2 image, using the rgb argument to plot(), and then by using st_rgb() first.
plot(p, rgb = 1:3)
## downsample set to 18
# plot(st_rgb(p[,,,1:3], maxColorValue=13600)) # FIXME: fails
select five random points from the bounding box of S2, and extract the band values at these points. What is the class of the object returned? Convert the object returned to an sf object.
pts = p %>% st_bbox() %>% st_as_sfc() %>% st_sample(5)
(p5 = st_extract(p, pts))
## stars object with 2 dimensions and 1 attribute
## attribute(s):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## MTD_MSIL1C.xml:10m:EPSG_32632 0 1699.25 2209.5 2021.95 2837.5 3648
## dimension(s):
## from to offset delta refsys point
## geometry 1 5 NA NA WGS 84 / UTM zone 32N TRUE
## band 1 4 NA NA NA NA
## values
## geometry POINT (344398.1 5986860),...,POINT (318560.4 5980671)
## band B4,...,B8
class(p5)
## [1] "stars"
st_as_sf(p5)
## Simple feature collection with 5 features and 4 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 318560.4 ymin: 5911044 xmax: 407394.7 ymax: 5996965
## Projected CRS: WGS 84 / UTM zone 32N
## B4 B3 B2 B8 geometry
## 1 3099 3165 3648 3088 POINT (344398.1 5986860)
## 2 0 0 0 0 POINT (407394.7 5911044)
## 3 2220 2199 2526 1973 POINT (370068.3 5947493)
## 4 1947 1806 2030 1379 POINT (404787.1 5996965)
## 5 2477 2731 3397 2754 POINT (318560.4 5980671)
For the 10 km radius circle around POINT(390000 5940000), compute the mean pixel values of the S2 image when downsampling the images with factor 30, and on the original resolution. Compute the relative difference between the results.
b = st_buffer(st_sfc(st_point(c(390000, 5940000)), crs = st_crs(p)),
units::set_units(10, km))
plot(p[,,,1], reset = FALSE, axes = TRUE)
## downsample set to 18
plot(b, col = NA, border = 'green', add = TRUE)
p1 = st_as_stars(p, downsample = 30)
a1 = aggregate(p1, b, mean)
For the full resolution, this takes a while:
system.time(a2 <- aggregate(p, b, mean))
## Warning in c.stars(structure(list(MTD_MSIL1C.xml.10m.EPSG_32632 =
## structure(c(1032.86211707414, : along argument ignored; maybe you wanted to use
## st_redimension?
## user system elapsed
## 88.007 1.305 80.785
Relative differences: we will work on the array of the stars objects:
(a1[[1]] - a2[[1]])/((a1[[1]]+a2[[1]])/2)
## [,1] [,2] [,3] [,4]
## [1,] 0.001055567 0.0009431265 0.001103254 7.781836e-05
Alternatively one could convert a1 and a2 to a data.frame, using as.data.frame, and work on the third column of the data frames.
Use hist to compute the histogram on the downsampled S2 image. Also do this for each of the bands. Use ggplot2 to compute a single plot with all four histograms in facets.
hist(p1)
hist(p1[,,,1])
hist(p1[,,,2])
hist(p1[,,,3])
hist(p1[,,,4])
library(ggplot2)
ggplot(as.data.frame(p1), aes(x = MTD_MSIL1C.xml.10m.EPSG_32632)) +
geom_histogram() + facet_wrap(~band)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Use st_crop to crop the S2 image to the area covered by the 10 km circle. Plot the results. Explore the effect of setting argument crop = FALSE
plot(st_crop(p, b))
## downsample set to 2
plot(st_crop(p, b, crop = FALSE))
## downsample set to 18
With the downsampled image, compute the logical layer where all four bands have pixel values higher than 1000. Use a raster algebra expression on the four bands (use split first), or use st_apply for this.
p_spl = split(p1)
p_spl$high = p_spl$B4 > 1000 & p_spl$B3 > 1000 & p_spl$B2 > 1000 & p_spl$B8 > 1000
plot(p_spl["high"])
alternative, using st_apply on the band dimension
p2 = st_apply(p1, 1:2, function(x) all(x > 1000))
plot(p2)
If dimensionality (point/line/polygon) varies in the data set, geometries must be reduced to the lowest dimension present (usually points). If all the observations are polygonal (polygon or multipolygon), contiguities (shared boundaries) are a sparse and robust neighbour representation (spdep::poly2nb()). Polygons may also be reduced to points by taking for example centroids, but neighbours found by triangulating points may not be the same as contiguity neighbours for the polygons being represented by these centroids (spdep::tri2nb()). If the geometries are multipoint, they must also be reduced to a single point. If the geometries have point rather than areal support, for example real estate transaction data, k-nearest neighbour (spdep::knn2nb(spdep::knearneigh())), graph-based (spdep::graph2nb() applied to the output of spdep::soi.graph(), spdep::relativeneigh() or spdep::gabrielneigh()) and distance-based methods (spdep::dnearneigh()`) may be used.
Graph-based functions for creating neighbour objects (spdep::soi.graph(), spdep::relativeneigh() and spdep::gabrielneigh()) may not be used if the support of the observations is not that of points on the plane. All other functions may be used with both planar and spherical/elliptical geometries, but the neighbours generated may differ if a non-planar data set is treated as planar.
A chessboard is an \(8 \times 8\) grid:
xy <- data.frame(expand.grid(1:8, 1:8), col=rep(c(rep(c("black", "white"), 4), rep(c("white", "black"), 4)), 4))
library(stars)
library(sf)
(xy %>% st_as_stars() %>% st_as_sf() -> grd)
## Simple feature collection with 64 features and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 0.5 ymin: 0.5 xmax: 8.5 ymax: 8.5
## CRS: NA
## First 10 features:
## col geometry
## 1 white POLYGON ((0.5 8.5, 1.5 8.5,...
## 2 black POLYGON ((1.5 8.5, 2.5 8.5,...
## 3 white POLYGON ((2.5 8.5, 3.5 8.5,...
## 4 black POLYGON ((3.5 8.5, 4.5 8.5,...
## 5 white POLYGON ((4.5 8.5, 5.5 8.5,...
## 6 black POLYGON ((5.5 8.5, 6.5 8.5,...
## 7 white POLYGON ((6.5 8.5, 7.5 8.5,...
## 8 black POLYGON ((7.5 8.5, 8.5 8.5,...
## 9 black POLYGON ((0.5 7.5, 1.5 7.5,...
## 10 white POLYGON ((1.5 7.5, 2.5 7.5,...
library(spdep)
## Loading required package: sp
## Loading required package: spData
(rook <- poly2nb(grd, queen=FALSE))
## Neighbour list object:
## Number of regions: 64
## Number of nonzero links: 224
## Percentage nonzero weights: 5.46875
## Average number of links: 3.5
The rook neighbours also form a grid, where the neighbours share a grid edge:
plot(st_geometry(grd), col=grd$col)
plot(rook, xy, add=TRUE, col="grey")
(queen <- poly2nb(grd, queen=TRUE))
## Neighbour list object:
## Number of regions: 64
## Number of nonzero links: 420
## Percentage nonzero weights: 10.25391
## Average number of links: 6.5625
The queen neighbours add neighbours sharing only one corner point:
plot(st_geometry(grd), col=grd$col)
plot(queen, xy, add=TRUE, col="grey")
and the difference yields neighbours sharing not more than one boundary point:
plot(st_geometry(grd), col=grd$col)
plot(diffnb(queen, rook), xy, add=TRUE, col="grey")
We can access cardinalities using card(), and tabulate their frequencies for the chessboard rook case:
((rook %>% card() -> rc) %>% table() -> t)
## .
## 2 3 4
## 4 24 36
Taking the counts found, we can construct the weights corresponding to those neighbour counts:
1/rev(as.numeric(names(t)))
## [1] 0.2500000 0.3333333 0.5000000
Plotting the row-standardized weights, we see that they up-weight the neighbours of observations with few neighbours, and down-weight the neighbours of observations with more neighbours:
grd$rc <- as.factor(1/rc)
plot(grd[, "rc"], main="rook row-standardized weights")
We can also use the cardinality frequency table to find counts of neighbours with (increasing) weights:
unname(rev(t))*rev(as.numeric(names(t)))
## [1] 144 72 8
This can be confirmed by tabulating the frequencies of weights yielded by nb2listw():
table(unlist(nb2listw(rook, style="W")$weights))
##
## 0.25 0.333333333333333 0.5
## 144 72 8
Repeating for the queen case again shows how row-standardization can engender edge effects:
((queen %>% card() -> rc) %>% table() -> t)
## .
## 3 5 8
## 4 24 36
1/rev(as.numeric(names(t)))
## [1] 0.1250000 0.2000000 0.3333333
grd$rc <- as.factor(1/rc)
plot(grd[, "rc"], main="rook row-standardised weights")
unname(rev(t))*rev(as.numeric(names(t)))
## [1] 288 120 12
table(unlist(nb2listw(queen, style="W")$weights))
##
## 0.125 0.2 0.333333333333333
## 288 120 12
Re-using the objects from exercise 14.3, we have:
(grd$col %>% factor() -> COL) %>% table()
## .
## black white
## 32 32
In the rook case, no black:black or white:white neighbours are found, differing greatly from the expected values, which are based on non-free sampling from the proportions of colours in the data. Highly significant spatial autocorrelation is detected:
(jc_r <- joincount.multi(COL, listw=nb2listw(rook, style="B")))
## Joincount Expected Variance z-value
## black:black 0.000 27.556 8.325 -9.5503
## white:white 0.000 27.556 8.325 -9.5503
## white:black 112.000 56.889 27.205 10.5662
## Jtot 112.000 56.889 27.205 10.5662
In the queen neighbour case, no spatial autocorrelation is found, despite a chessboard looking spatially structured:
joincount.multi(COL, listw=nb2listw(queen, style="B"))
## Joincount Expected Variance z-value
## black:black 49.000 51.667 23.616 -0.5487
## white:white 49.000 51.667 23.616 -0.5487
## white:black 112.000 106.667 47.796 0.7714
## Jtot 112.000 106.667 47.796 0.7714
This is because we have chosen to see all eight neighbour grid cells as neighbours (away from the edges of the board), so the two categories occur equally often as neighbour values, as expected.
First, create an uncorrelated variable, and confirm that it is uncorrelated:
set.seed(1)
x <- rnorm(nrow(grd))
moran.test(x, nb2listw(queen, style="W"), randomisation=FALSE, alternative="two.sided")
##
## Moran I test under normality
##
## data: x
## weights: nb2listw(queen, style = "W")
##
## Moran I statistic standard deviate = 0.27024, p-value = 0.787
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic Expectation Variance
## 0.002292497 -0.015873016 0.004518556
Next inject patterning into the variable by adding a linear trend:
x_t <- x + (0.15 * xy$Var1)
moran.test(x_t, nb2listw(queen, style="W"), randomisation=FALSE, alternative="two.sided")
##
## Moran I test under normality
##
## data: x_t
## weights: nb2listw(queen, style = "W")
##
## Moran I statistic standard deviate = 3.0064, p-value = 0.002644
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic Expectation Variance
## 0.186218071 -0.015873016 0.004518556
Test again having taken the residuals from a linear model removing the injected trend:
lm.morantest(lm(x_t ~ xy$Var1), nb2listw(queen, style="W"), alternative="two.sided")
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = x_t ~ xy$Var1)
## weights: nb2listw(queen, style = "W")
##
## Moran I statistic standard deviate = 0.28036, p-value = 0.7792
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I Expectation Variance
## -0.012449399 -0.030600358 0.004191544
This is important to understand because the spatial patterning in a variable of interest, and picked up by a global measure of spatial autocorrelation, may be driven by an omitted variable. If we cannot add that variable, a latent variable or mixed effects model may be a good choice.
Following the discovery that, irrespective of the number of draws, permutation conditional standard deviates (omitting the value at \(i\)) are very close to analytical conditional standard deviates for \(I_i\) and \(G_i\), it seems that the use of conditional permutation is no longer necessary, unless one wishes to check for robustness against distributional assumptions. It will be interesting to see whether the same holds for local Geary’s \(C_i\) and other local measures.
False discovery rate adjustment is required when conducting repeated tests on the same data set. Usually, local measures of spatial autocorrelation are calculated for all the observations in a data set, and so constitute repeated tests. When repeated tests are conducted, the usual reading of confidence intervals and probability values must be adjusted to take the repeated use of the data into account.
If we start with the standard local Moran’s \(I_i\) for the random values with a slight 1D trend, upgraded to analytical conditional standard deviates, but with only the standard intercept-only mean model, we have a starting point; a fair number of the values exceed 2:
locm <- localmoran(x_t, nb2listw(queen, style="W"), conditional=TRUE,
alternative="two.sided")
plot(density(locm[, 4]))
abline(v=c(-2, 2))
grd$locm_sd <- locm[, 4]
plot(grd[, "locm_sd"])
sum(p.adjust(locm[, 5], method="none") < 0.05)
## [1] 6
If we apply false discovery rate adjustment, we have just one significant measure:
sum(p.adjust(locm[, 5], method="fdr") < 0.05)
## [1] 1
In the first Saddlepoint approximation also for the random values with a slight 1D trend, the distribution of standard deviates shifts leftward, with both positive and negative values beyond abs(2):
lm_null <- lm(x_t ~ 1)
locm_null <- summary(localmoran.sad(lm_null, nb=queen, style="W",
alternative="two.sided"))
plot(density(locm_null[, "Saddlepoint"]))
abline(v=c(-2, 2))
grd$locm_null_sd <- locm_null[, "Saddlepoint"]
plot(grd[, "locm_null_sd"])
sum(p.adjust(locm_null[, "Pr. (Sad)"], method="none") < 0.05)
## [1] 8
If we apply false discovery rate adjustment, we also have just one significant measure:
sum(p.adjust(locm_null[, "Pr. (Sad)"], method="fdr") < 0.05)
## [1] 1
Once we analyse a model including the 1D trend, most of the distribution of standard deviate values is between -2 and 2:
lm_trend <- lm(x_t ~ xy$Var1)
locm_tr <- summary(localmoran.sad(lm_trend, nb=queen, style="W",
alternative="two.sided"))
plot(density(locm_tr[, "Saddlepoint"]))
abline(v=c(-2, 2))
grd$locm_tr_sd <- locm_tr[, "Saddlepoint"]
plot(grd[, "locm_tr_sd"])
sum(p.adjust(locm_tr[, "Pr. (Sad)"], method="none") < 0.05)
## [1] 2
If we apply false discovery rate adjustment, we now have no significant measures, as expected:
sum(p.adjust(locm_tr[, "Pr. (Sad)"], method="fdr") < 0.05)
## [1] 0
localmoran.sad() or localmoran.exact() provide both richer mean models, and estimates of the standard deviates built on the underlying spatial relationships for each observation, rather than analytical or permutation assumptions for the whole data set. This is achieved at the cost of longer compute times and larger memory use, especially when the Omega= argument to localmoran.sad() or localmoran.exact.alt() is used, because this is a dense \(n \times n\) matrix.
The HSAR package includes an upper level polygon support municipality department data set, ans a lower level property data set. Both are "sf" objects, in the same projected CRS.
library(sf)
library(HSAR)
data(depmunic)
data(properties)
depmunic$popdens <- depmunic$population/ (10000*depmunic$area)
depmunic$foreigners <- 100 * depmunic$pop_rest/ depmunic$population
depmunic$prgreensp <- depmunic$greensp/ (10000*depmunic$area)
In the vignette, two upper-level variables are added to the six already present, and we change the green space variable scaling to avoid numerical issues in calculating coefficient standard errors.
summary(depmunic)
## num_dep airbnb museums population
## Min. :1.0 Min. : 144.0 Min. : 0.000 Min. : 45168
## 1st Qu.:2.5 1st Qu.: 377.5 1st Qu.: 0.000 1st Qu.: 78753
## Median :4.0 Median : 565.0 Median : 0.000 Median : 98283
## Mean :4.0 Mean : 712.3 Mean : 2.571 Mean : 93702
## 3rd Qu.:5.5 3rd Qu.: 675.5 3rd Qu.: 0.500 3rd Qu.:112677
## Max. :7.0 Max. :2171.0 Max. :17.000 Max. :129603
## pop_rest greensp area geometry
## Min. : 2735 Min. : 40656 Min. :3.987 POLYGON :7
## 1st Qu.: 4588 1st Qu.: 42340 1st Qu.:4.264 epsg:2100 :0
## Median : 5099 Median : 93715 Median :4.836 +proj=tmer...:0
## Mean : 7109 Mean :183796 Mean :5.420
## 3rd Qu.: 8110 3rd Qu.:294286 3rd Qu.:6.412
## Max. :16531 Max. :478951 Max. :7.764
## popdens foreigners prgreensp
## Min. :0.8039 Min. : 4.890 Min. :0.7708
## 1st Qu.:1.2979 1st Qu.: 5.058 1st Qu.:0.9653
## Median :1.8763 Median : 6.055 Median :1.2070
## Mean :1.8697 Mean : 7.369 Mean :3.3883
## 3rd Qu.:2.2807 3rd Qu.: 8.882 3rd Qu.:4.9527
## Max. :3.2508 Max. :12.755 Max. :9.9040
The properties data set has only four variables, but with price per square metre already added:
summary(properties)
## id size price prpsqm
## Length:1000 Min. : 21.00 Min. : 8000 Min. : 207.5
## Class :character 1st Qu.: 55.00 1st Qu.: 44750 1st Qu.: 631.1
## Mode :character Median : 75.00 Median : 80000 Median :1098.8
## Mean : 83.17 Mean : 123677 Mean :1316.9
## 3rd Qu.: 98.00 3rd Qu.: 147000 3rd Qu.:1814.2
## Max. :1250.00 Max. :5500000 Max. :9166.7
## age dist_metro geometry
## Min. : 0.00 Min. : 4.423 POINT :1000
## 1st Qu.:11.00 1st Qu.: 338.523 epsg:2100 : 0
## Median :42.00 Median : 537.250 +proj=tmer...: 0
## Mean :34.16 Mean : 610.772
## 3rd Qu.:48.00 3rd Qu.: 819.929
## Max. :67.00 Max. :1888.856
The values of the variables in depmunic get copied to each of the properties falling within the boundaries of the municipality departments:
properties_in_dd <- st_join(properties, depmunic, join = st_within)
For polygon support, we prefer contiguous neighbours:
(mun_nb <- spdep::poly2nb(depmunic, row.names=as.character(depmunic$num_dep)))
## Neighbour list object:
## Number of regions: 7
## Number of nonzero links: 20
## Percentage nonzero weights: 40.81633
## Average number of links: 2.857143
Global spatial autocorrelation is marginally detected for the green space variable:
spdep::moran.test(depmunic$prgreensp, nb2listw(mun_nb))
##
## Moran I test under randomisation
##
## data: depmunic$prgreensp
## weights: nb2listw(mun_nb)
##
## Moran I statistic standard deviate = 1.825, p-value = 0.034
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.22384238 -0.16666667 0.04578844
Unlike the vignette, which uses distance neighbours up to 1300 m and creates a very dense representation, we choose k=4 k-nearest neighbours, then convert to symmetry (note that some point locations are duplicated, preventing the use of spatial indexing):
(pr_nb_k4s <- spdep::knn2nb(spdep::knearneigh(properties, k=4), sym=TRUE, row.names=properties$id))
## Warning in spdep::knearneigh(properties, k = 4): knearneigh: identical points
## found
## Neighbour list object:
## Number of regions: 1000
## Number of nonzero links: 5874
## Percentage nonzero weights: 0.5874
## Average number of links: 5.874
Copying out has led to the introduction of very powerful positive spatial autocorrelation in this and other variables copied out:
spdep::moran.test(properties_in_dd$prgreensp, nb2listw(pr_nb_k4s))
##
## Moran I test under randomisation
##
## data: properties_in_dd$prgreensp
## weights: nb2listw(pr_nb_k4s)
##
## Moran I statistic standard deviate = 51.666, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.9757099996 -0.0010010010 0.0003573715
The vignette proposes the full property level and municipal department level set of variables straight away. Here we choose the proerty level ones first, and update for the copied out municipal department level ones next:
f_pr <- prpsqm ~ size + age + dist_metro
f_pr_md <- update(f_pr, . ~ . + foreigners + prgreensp + popdens + museums + airbnb)
Adding in the copied out upper level variables appears to account for more of the variability of the response than leaving them out:
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-36. For overview type 'help("mgcv-package")'.
pr_base <- gam(f_pr, data=properties_in_dd)
pr_2lev <- gam(f_pr_md, data=properties_in_dd)
anova(pr_base, pr_2lev, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens +
## museums + airbnb
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 996 625149227
## 2 991 524571512 5 100577716 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(pr_base)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## prpsqm ~ size + age + dist_metro
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1702.61693 78.59752 21.66 < 2e-16 ***
## size 5.18681 0.46510 11.15 < 2e-16 ***
## age -18.11938 1.33737 -13.55 < 2e-16 ***
## dist_metro -0.32446 0.07115 -4.56 5.75e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.234 Deviance explained = 23.7%
## GCV = 6.3018e+05 Scale est. = 6.2766e+05 n = 1000
summary(pr_2lev)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens +
## museums + airbnb
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1648.17176 189.84951 8.681 <2e-16 ***
## size 4.29551 0.43386 9.901 <2e-16 ***
## age -20.18585 1.27009 -15.893 <2e-16 ***
## dist_metro -0.15180 0.07159 -2.120 0.0342 *
## foreigners -38.13473 22.54311 -1.692 0.0910 .
## prgreensp 23.90080 16.33291 1.463 0.1437
## popdens -51.82590 103.65578 -0.500 0.6172
## museums -19.06125 21.27904 -0.896 0.3706
## airbnb 0.63284 0.32887 1.924 0.0546 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.354 Deviance explained = 36%
## GCV = 5.3414e+05 Scale est. = 5.2934e+05 n = 1000
Adding an upper level IID random effect to the base formula also improves the fit of the model substantially:
pr_base_iid <- gam(update(f_pr, . ~ . + s(num_dep, bs="re")), data=properties_in_dd)
anova(pr_base, pr_base_iid, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro + s(num_dep, bs = "re")
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 996 625149227
## 2 995 557913777 0.99993 67235450 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(pr_base_iid)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## prpsqm ~ size + age + dist_metro + s(num_dep, bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2244.07579 89.35422 25.114 < 2e-16 ***
## size 4.63594 0.44249 10.477 < 2e-16 ***
## age -19.12405 1.26739 -15.089 < 2e-16 ***
## dist_metro -0.18383 0.06848 -2.685 0.00738 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(num_dep) 0.9916 1 118.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.316 Deviance explained = 31.9%
## GCV = 5.6353e+05 Scale est. = 5.6071e+05 n = 1000
This improvement is much more moderate when both the upper level variables and IID random effect are present:
pr_2lev_iid <- gam(update(f_pr_md, . ~ . + s(num_dep, bs="re")), data=properties_in_dd)
anova(pr_2lev, pr_2lev_iid, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens +
## museums + airbnb
## Model 2: prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens +
## museums + airbnb + s(num_dep, bs = "re")
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 991.00 524571512
## 2 990.04 522144057 0.95676 2427454 0.02981 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(pr_2lev_iid)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens +
## museums + airbnb + s(num_dep, bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1520.74633 200.41291 7.588 7.48e-14 ***
## size 4.29515 0.43303 9.919 < 2e-16 ***
## age -20.09458 1.26851 -15.841 < 2e-16 ***
## dist_metro -0.14574 0.07152 -2.038 0.04184 *
## foreigners -83.04410 32.17854 -2.581 0.01000 *
## prgreensp -68.29107 49.95934 -1.367 0.17196
## popdens 261.72640 191.05195 1.370 0.17102
## museums -125.97470 58.73993 -2.145 0.03223 *
## airbnb 1.92092 0.73695 2.607 0.00928 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(num_dep) 0.7921 1 3.811 0.0285 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.357 Deviance explained = 36.3%
## GCV = 5.3252e+05 Scale est. = 5.2731e+05 n = 1000
The "mrf" smooth term needs ID keys set so that the neighbour object is correctly matched to the observations. Once these are provided, the properties level model with a municipality department level MRF smooth may be fit:
names(mun_nb) <- attr(mun_nb, "region.id")
properties_in_dd$num_dep <- factor(properties_in_dd$num_dep)
pr_base_mrf <- gam(update(f_pr, . ~ . + s(num_dep, bs="mrf", xt=list(nb=mun_nb))),
data=properties_in_dd)
summary(pr_base_mrf)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## prpsqm ~ size + age + dist_metro + s(num_dep, bs = "mrf", xt = list(nb = mun_nb))
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1730.81146 76.44218 22.642 <2e-16 ***
## size 4.32010 0.43280 9.982 <2e-16 ***
## age -20.00432 1.26761 -15.781 <2e-16 ***
## dist_metro -0.14718 0.07142 -2.061 0.0396 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(num_dep) 5.852 6 31.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.357 Deviance explained = 36.3%
## GCV = 5.3255e+05 Scale est. = 5.2731e+05 n = 1000
Repeating for the extended model with upper level variables present, we see that no more response variability is accounted for than in the lower level variables only MRF RE model, and none of the upper level variables are significant at conventional levels.
pr_2lev_mrf <- gam(update(f_pr_md, . ~ . + s(num_dep, bs="mrf", xt=list(nb=mun_nb))),
data=properties_in_dd)
summary(pr_2lev_mrf)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens +
## museums + airbnb + s(num_dep, bs = "mrf", xt = list(nb = mun_nb))
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1510.58909 425.16218 3.553 0.000399 ***
## size 4.29515 0.43303 9.919 < 2e-16 ***
## age -20.09456 1.26851 -15.841 < 2e-16 ***
## dist_metro -0.14574 0.07152 -2.038 0.041844 *
## foreigners -42.18041 58.30966 -0.723 0.469613
## prgreensp 15.34415 54.84749 0.280 0.779720
## popdens -50.09688 259.59932 -0.193 0.847016
## museums -53.40432 55.25567 -0.966 0.334033
## airbnb 1.01045 0.81453 1.241 0.215072
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(num_dep) 0.7922 2 1.906 0.0285 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.357 Deviance explained = 36.3%
## GCV = 5.3252e+05 Scale est. = 5.2731e+05 n = 1000
It also seems that the model without upper level variables outperforms that with them included:
anova(pr_base_mrf, pr_2lev_mrf, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: prpsqm ~ size + age + dist_metro + s(num_dep, bs = "mrf", xt = list(nb = mun_nb))
## Model 2: prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens +
## museums + airbnb + s(num_dep, bs = "mrf", xt = list(nb = mun_nb))
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 990.01 522113016
## 2 990.04 522143950 -0.037064 -30934 0.054 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
and the MRF RE outperforms the IID RE:
anova(pr_base_mrf, pr_base_iid, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: prpsqm ~ size + age + dist_metro + s(num_dep, bs = "mrf", xt = list(nb = mun_nb))
## Model 2: prpsqm ~ size + age + dist_metro + s(num_dep, bs = "re")
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 990.01 522113016
## 2 995.00 557913777 -4.9939 -35800762 2.786e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
First create a spatial weights object from the k=4 symmetrized neighbour object:
library(spatialreg)
## Loading required package: Matrix
##
## Attaching package: 'spatialreg'
## The following objects are masked from 'package:spdep':
##
## as_dgRMatrix_listw, as_dsCMatrix_I, as_dsCMatrix_IrW,
## as_dsTMatrix_listw, as.spam.listw, can.be.simmed, cheb_setup,
## create_WX, do_ldet, eigen_pre_setup, eigen_setup, eigenw,
## errorsarlm, get.ClusterOption, get.coresOption, get.mcOption,
## get.VerboseOption, get.ZeroPolicyOption, GMargminImage, GMerrorsar,
## griffith_sone, gstsls, Hausman.test, impacts, intImpacts,
## Jacobian_W, jacobianSetup, l_max, lagmess, lagsarlm, lextrB,
## lextrS, lextrW, lmSLX, LU_prepermutate_setup, LU_setup,
## Matrix_J_setup, Matrix_setup, mcdet_setup, MCMCsamp, ME, mom_calc,
## mom_calc_int2, moments_setup, powerWeights, sacsarlm,
## SE_classic_setup, SE_interp_setup, SE_whichMin_setup,
## set.ClusterOption, set.coresOption, set.mcOption,
## set.VerboseOption, set.ZeroPolicyOption, similar.listw, spam_setup,
## spam_update_setup, SpatialFiltering, spautolm, spBreg_err,
## spBreg_lag, spBreg_sac, stsls, subgraph_eigenw, trW
lw <- nb2listw(pr_nb_k4s)
Fit a linear model to the lower-level data; all the included variables seem worth retaining:
LM_pr <- lm(f_pr, data=properties_in_dd)
summary(LM_pr)
##
## Call:
## lm(formula = f_pr, data = properties_in_dd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5691.2 -456.7 -202.2 251.9 6872.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1702.61693 78.59752 21.66 < 2e-16 ***
## size 5.18681 0.46510 11.15 < 2e-16 ***
## age -18.11938 1.33737 -13.55 < 2e-16 ***
## dist_metro -0.32446 0.07115 -4.56 5.75e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 792.2 on 996 degrees of freedom
## Multiple R-squared: 0.2368, Adjusted R-squared: 0.2345
## F-statistic: 103 on 3 and 996 DF, p-value: < 2.2e-16
However, there is strong residual autocorrelation:
lm.morantest(LM_pr, lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = f_pr, data = properties_in_dd)
## weights: lw
##
## Moran I statistic standard deviate = 26.039, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.4872794374 -0.0023026155 0.0003535235
Robust Lagrange multiplier tests suggest that the fitted model should include a spatial autoregressive process in the residuals, but not in the response:
lm.LMtests(LM_pr, lw, test=c("RLMerr", "RLMlag"))
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = f_pr, data = properties_in_dd)
## weights: lw
##
## RLMerr = 150.88, df = 1, p-value < 2.2e-16
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = f_pr, data = properties_in_dd)
## weights: lw
##
## RLMlag = 1.2688, df = 1, p-value = 0.26
Adding in the copied out municipality department level variables, we see that they do not seem to be worth retaining (unless there are good reasons for doing so); they do however improve model fit:
LM_pr_md <- lm(f_pr_md, data=properties_in_dd)
summary(LM_pr_md)
##
## Call:
## lm(formula = f_pr_md, data = properties_in_dd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5175.9 -371.6 -107.1 266.7 6648.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1648.17176 189.84951 8.681 <2e-16 ***
## size 4.29551 0.43386 9.901 <2e-16 ***
## age -20.18585 1.27009 -15.893 <2e-16 ***
## dist_metro -0.15180 0.07159 -2.120 0.0342 *
## foreigners -38.13473 22.54311 -1.692 0.0910 .
## prgreensp 23.90080 16.33291 1.463 0.1437
## popdens -51.82590 103.65578 -0.500 0.6172
## museums -19.06125 21.27904 -0.896 0.3706
## airbnb 0.63284 0.32887 1.924 0.0546 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 727.6 on 991 degrees of freedom
## Multiple R-squared: 0.3596, Adjusted R-squared: 0.3544
## F-statistic: 69.55 on 8 and 991 DF, p-value: < 2.2e-16
The pre-test results are similar to those for the properties-only variables:
lm.morantest(LM_pr_md, lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = f_pr_md, data = properties_in_dd)
## weights: lw
##
## Moran I statistic standard deviate = 23.707, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.4307327715 -0.0070398073 0.0003410046
and the LM tests continue to indicate an omitted spatial process in the residual rather than the response:
lm.LMtests(LM_pr_md, lw, test=c("RLMerr", "RLMlag"))
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = f_pr_md, data = properties_in_dd)
## weights: lw
##
## RLMerr = 116.67, df = 1, p-value < 2.2e-16
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = f_pr_md, data = properties_in_dd)
## weights: lw
##
## RLMlag = 0.0035207, df = 1, p-value = 0.9527
We may update the formula for the properties-only model to include municipality department “fixed effects”, dummy variables:
LM_pr_fx <- lm(update(f_pr, . ~ . + num_dep), data=properties_in_dd)
summary(LM_pr_fx)
##
## Call:
## lm(formula = update(f_pr, . ~ . + num_dep), data = properties_in_dd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5175.3 -377.2 -97.9 272.7 6644.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.352e+03 9.816e+01 23.964 < 2e-16 ***
## size 4.295e+00 4.330e-01 9.919 < 2e-16 ***
## age -2.007e+01 1.269e+00 -15.819 < 2e-16 ***
## dist_metro -1.442e-01 7.154e-02 -2.015 0.044176 *
## num_dep2 -3.342e+02 8.695e+01 -3.844 0.000129 ***
## num_dep3 -4.896e+02 1.290e+02 -3.794 0.000157 ***
## num_dep4 -1.031e+03 1.193e+02 -8.641 < 2e-16 ***
## num_dep5 -8.222e+02 8.251e+01 -9.964 < 2e-16 ***
## num_dep6 -9.183e+02 7.861e+01 -11.681 < 2e-16 ***
## num_dep7 -6.527e+02 8.250e+01 -7.911 6.78e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 726.2 on 990 degrees of freedom
## Multiple R-squared: 0.3627, Adjusted R-squared: 0.3569
## F-statistic: 62.6 on 9 and 990 DF, p-value: < 2.2e-16
The pre-test output is similar to that for the models considered above:
lm.morantest(LM_pr_fx, lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = update(f_pr, . ~ . + num_dep), data =
## properties_in_dd)
## weights: lw
##
## Moran I statistic standard deviate = 23.611, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.4268054149 -0.0079765936 0.0003390813
lm.LMtests(LM_pr_fx, lw, test=c("RLMerr", "RLMlag"))
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = update(f_pr, . ~ . + num_dep), data =
## properties_in_dd)
## weights: lw
##
## RLMerr = 117.99, df = 1, p-value < 2.2e-16
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = update(f_pr, . ~ . + num_dep), data =
## properties_in_dd)
## weights: lw
##
## RLMlag = 0.030669, df = 1, p-value = 0.861
We may fit a regimes model, where separate regression coefficients are calculated for interactions between the municipality department dummies and the included variables; size and dist_metro only retian influence for municipality departments 1 and 2:
LM_pr_reg <- lm(update(f_pr, . ~ num_dep/(0 + .)), data=properties_in_dd)
summary(LM_pr_reg)
##
## Call:
## lm(formula = update(f_pr, . ~ num_dep/(0 + .)), data = properties_in_dd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2055.0 -328.6 -62.8 257.2 6353.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## num_dep1 2366.55794 206.45952 11.463 < 2e-16 ***
## num_dep2 1114.08697 167.84304 6.638 5.29e-11 ***
## num_dep3 2095.07546 424.99044 4.930 9.68e-07 ***
## num_dep4 1351.29266 465.10378 2.905 0.003752 **
## num_dep5 1695.64352 234.29757 7.237 9.30e-13 ***
## num_dep6 1611.35873 167.16126 9.640 < 2e-16 ***
## num_dep7 1705.22449 176.18849 9.678 < 2e-16 ***
## num_dep1:size 1.36164 0.51169 2.661 0.007918 **
## num_dep2:size 14.62370 0.99848 14.646 < 2e-16 ***
## num_dep3:size 2.25680 4.50973 0.500 0.616886
## num_dep4:size 3.79360 4.56205 0.832 0.405864
## num_dep5:size 3.06546 1.93716 1.582 0.113872
## num_dep6:size 1.49313 1.33800 1.116 0.264724
## num_dep7:size 5.91935 1.37694 4.299 1.89e-05 ***
## num_dep1:age -6.83169 3.55084 -1.924 0.054651 .
## num_dep2:age -9.36978 3.07762 -3.044 0.002393 **
## num_dep3:age -18.07526 5.65924 -3.194 0.001449 **
## num_dep4:age -22.43736 5.03824 -4.453 9.43e-06 ***
## num_dep5:age -27.07143 2.71096 -9.986 < 2e-16 ***
## num_dep6:age -23.33611 2.34894 -9.935 < 2e-16 ***
## num_dep7:age -24.34731 2.65955 -9.155 < 2e-16 ***
## num_dep1:dist_metro -0.72509 0.21429 -3.384 0.000744 ***
## num_dep2:dist_metro -0.61219 0.17457 -3.507 0.000474 ***
## num_dep3:dist_metro -0.56261 0.52693 -1.068 0.285918
## num_dep4:dist_metro 0.02032 0.43294 0.047 0.962570
## num_dep5:dist_metro 0.10681 0.29026 0.368 0.712965
## num_dep6:dist_metro 0.05074 0.09829 0.516 0.605810
## num_dep7:dist_metro -0.14110 0.14727 -0.958 0.338274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 663.8 on 972 degrees of freedom
## Multiple R-squared: 0.8323, Adjusted R-squared: 0.8274
## F-statistic: 172.2 on 28 and 972 DF, p-value: < 2.2e-16
The pre-test results are now changed, with possible spatial processes in both residuals and response being indicated:
lm.morantest(LM_pr_reg, lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = update(f_pr, . ~ num_dep/(0 + .)), data =
## properties_in_dd)
## weights: lw
##
## Moran I statistic standard deviate = 18.521, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.3198250101 -0.0139390950 0.0003247394
lm.LMtests(LM_pr_reg, lw, test=c("RLMerr", "RLMlag"))
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = update(f_pr, . ~ num_dep/(0 + .)), data =
## properties_in_dd)
## weights: lw
##
## RLMerr = 40.873, df = 1, p-value = 1.625e-10
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = update(f_pr, . ~ num_dep/(0 + .)), data =
## properties_in_dd)
## weights: lw
##
## RLMlag = 19.055, df = 1, p-value = 1.27e-05
Fitting models initially by maximum likelihood (GMM may also be used), we pre-compute the eigenvalues:
eigs <- eigenw(lw)
The strong residual autocorrelation is picked up by the spatial coefficient, but unfortunately the Hausman test shows strong mis-specification:
SEM_pr <- errorsarlm(f_pr, data=properties_in_dd, listw=lw, Durbin=FALSE,
control=list(pre_eig=eigs))
summary(SEM_pr, Hausman=TRUE)
##
## Call:errorsarlm(formula = f_pr, data = properties_in_dd, listw = lw,
## Durbin = FALSE, control = list(pre_eig = eigs))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3214.224 -337.719 -75.561 196.920 5411.793
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1999.09195 120.70305 16.5621 < 2e-16
## size 3.21092 0.35840 8.9591 < 2e-16
## age -23.64759 1.06426 -22.2199 < 2e-16
## dist_metro -0.27302 0.14694 -1.8581 0.06315
##
## Lambda: 0.6927, LR test value: 459.76, p-value: < 2.22e-16
## Asymptotic standard error: 0.024179
## z-value: 28.649, p-value: < 2.22e-16
## Wald statistic: 820.74, p-value: < 2.22e-16
##
## Log likelihood: -7861.931 for error model
## ML residual variance (sigma squared): 354240, (sigma: 595.18)
## Number of observations: 1000
## Number of parameters estimated: 6
## AIC: 15736, (AIC for lm: 16194)
## Hausman test: -246.29, df: 4, p-value: < 2.22e-16
The Hausman test compares the OLS and SEM coefficient estimates and their standard errors, assessing whether their distributions overlap sufficiently to suggest the absence of major mis-specification:
(LM_coefs <- coef(summary(LM_pr)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1702.6169273 78.59752134 21.662476 1.432984e-85
## size 5.1868075 0.46509907 11.152049 2.679763e-27
## age -18.1193826 1.33737467 -13.548472 1.662323e-38
## dist_metro -0.3244594 0.07115232 -4.560068 5.749564e-06
(SEM_coefs <- coef(summary(SEM_pr)))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1999.0919466 120.7030525 16.562066 0.00000000
## size 3.2109208 0.3583989 8.959070 0.00000000
## age -23.6475917 1.0642550 -22.219854 0.00000000
## dist_metro -0.2730234 0.1469362 -1.858108 0.06315365
The tables are harder to read than the figure, which shows that the coefficient estimates do differ a lot for two variables, somewhat for the intercept, and little for one variable, but where the ML standard error estimate under usual assumptions crosses zero:
opar <- par(mfrow=c(2,2))
plot(1, type="n", xlim=c(1400, 2500), ylim=c(0, 0.006), xlab=rownames(LM_coefs)[1], ylab="")
curve(dnorm(x, mean=LM_coefs[1,1], sd=LM_coefs[1,2]), add=TRUE)
abline(v=LM_coefs[1,1])
abline(v=SEM_coefs[1,1], lwd=2, col="orange")
curve(dnorm(x, mean=SEM_coefs[1,1], sd=SEM_coefs[1,2]), add=TRUE, col="orange", lwd=2)
legend("topright", legend=c("LM", "SEM"), col=c("black", "orange"), lwd=1:2, bty="n")
plot(1, type="n", xlim=c(1.5, 7), ylim=c(0, 1.1), xlab=rownames(LM_coefs)[2], ylab="")
curve(dnorm(x, mean=LM_coefs[2,1], sd=LM_coefs[2,2]), add=TRUE)
abline(v=LM_coefs[2,1])
abline(v=SEM_coefs[2,1], lwd=2, col="orange")
curve(dnorm(x, mean=SEM_coefs[2,1], sd=SEM_coefs[2,2]), add=TRUE, col="orange", lwd=2)
plot(1, type="n", xlim=c(-28, -13), ylim=c(0, 0.4), xlab=rownames(LM_coefs)[3], ylab="")
curve(dnorm(x, mean=LM_coefs[3,1], sd=LM_coefs[3,2]), add=TRUE)
abline(v=LM_coefs[3,1])
abline(v=SEM_coefs[3,1], lwd=2, col="orange")
curve(dnorm(x, mean=SEM_coefs[3,1], sd=SEM_coefs[3,2]), add=TRUE, col="orange", lwd=2)
plot(1, type="n", xlim=c(-0.9, 0.3), ylim=c(0, 6), xlab=rownames(LM_coefs)[4], ylab="")
curve(dnorm(x, mean=LM_coefs[4,1], sd=LM_coefs[4,2]), add=TRUE)
abline(v=LM_coefs[4,1])
abline(v=SEM_coefs[4,1], lwd=2, col="orange")
curve(dnorm(x, mean=SEM_coefs[4,1], sd=SEM_coefs[4,2]), add=TRUE, col="orange", lwd=2)
par(opar)
The Hausman test also suggests mis-specification for the SEM model augmented with the municipality department level variables:
SEM_pr_md <- errorsarlm(f_pr_md, data=properties_in_dd, listw=lw, Durbin=FALSE,
control=list(pre_eig=eigs))
summary(SEM_pr_md, Hausman=TRUE)
##
## Call:errorsarlm(formula = f_pr_md, data = properties_in_dd, listw = lw,
## Durbin = FALSE, control = list(pre_eig = eigs))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3421.923 -320.793 -65.301 224.785 5452.101
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1945.79330 333.00265 5.8432 5.121e-09
## size 3.05839 0.35711 8.5643 < 2.2e-16
## age -23.45168 1.06301 -22.0616 < 2.2e-16
## dist_metro -0.14969 0.13214 -1.1328 0.2573
## foreigners -7.66310 42.72513 -0.1794 0.8577
## prgreensp 38.40538 31.36427 1.2245 0.2208
## popdens -211.52190 190.49687 -1.1104 0.2668
## museums -33.84713 39.33249 -0.8605 0.3895
## airbnb 0.59183 0.60062 0.9854 0.3244
##
## Lambda: 0.62414, LR test value: 335.94, p-value: < 2.22e-16
## Asymptotic standard error: 0.028289
## z-value: 22.063, p-value: < 2.22e-16
## Wald statistic: 486.78, p-value: < 2.22e-16
##
## Log likelihood: -7836.138 for error model
## ML residual variance (sigma squared): 345330, (sigma: 587.65)
## Number of observations: 1000
## Number of parameters estimated: 11
## AIC: 15694, (AIC for lm: 16028)
## Hausman test: -1348.4, df: 9, p-value: < 2.22e-16
Extending to the SDEM models, and reporting impacts:
SDEM_pr <- errorsarlm(f_pr, data=properties_in_dd, listw=lw, Durbin=TRUE,
control=list(pre_eig=eigs))
summary(impacts(SDEM_pr), short=TRUE, zstats=TRUE)
## Impact measures (SDEM, estimable, n):
## Direct Indirect Total
## size 3.4445402 2.0126058 5.4571460
## age -21.9222898 10.9931411 -10.9291487
## dist_metro 0.2000411 -0.5494199 -0.3493789
## ========================================================
## Standard errors:
## Direct Indirect Total
## size 0.3983375 1.0053504 1.2310345
## age 1.1425638 2.5191829 3.1162005
## dist_metro 0.2734082 0.3104821 0.1551602
## ========================================================
## Z-values:
## Direct Indirect Total
## size 8.647291 2.001895 4.432976
## age -19.186928 4.363772 -3.507203
## dist_metro 0.731657 -1.769570 -2.251730
##
## p-values:
## Direct Indirect Total
## size < 2e-16 0.045296 9.2941e-06
## age < 2e-16 1.2784e-05 0.00045284
## dist_metro 0.46438 0.076799 0.02433931
we have Hausman test results still indicating strong mis-specification:
Hausman.test(SDEM_pr)
##
## Spatial Hausman test (asymptotic)
##
## data: NULL
## Hausman test = 53.765, df = 7, p-value = 2.618e-09
The same applies to the properties variables augmented with the municipality department level variables:
SDEM_pr_md <- errorsarlm(f_pr_md, data=properties_in_dd, listw=lw, Durbin=TRUE,
control=list(pre_eig=eigs))
summary(impacts(SDEM_pr_md), short=TRUE, zstats=TRUE)
## Impact measures (SDEM, estimable, n):
## Direct Indirect Total
## size 3.2902831 1.6915614 4.9818445
## age -22.5877018 7.0789016 -15.5088002
## dist_metro 0.2706259 -0.4643729 -0.1937470
## foreigners 107.6194776 -146.8457500 -39.2262724
## prgreensp 35.3436543 -12.5353752 22.8082791
## popdens -456.7261794 400.0782288 -56.6479506
## museums 51.9819779 -93.3777533 -41.3957754
## airbnb -0.8658815 1.7162902 0.8504087
## ========================================================
## Standard errors:
## Direct Indirect Total
## size 0.3900357 0.9742177 1.1837952
## age 1.1250931 2.5043048 3.0554700
## dist_metro 0.2747535 0.3097415 0.1459340
## foreigners 87.2520039 94.9604414 45.6522660
## prgreensp 74.3147381 79.3194509 33.1282709
## popdens 403.5523890 439.7109664 210.7386098
## museums 85.7152273 93.9719231 43.7578240
## airbnb 1.1452242 1.2862740 0.6730806
## ========================================================
## Z-values:
## Direct Indirect Total
## size 8.4358510 1.7363279 4.2083670
## age -20.0762964 2.8266933 -5.0757495
## dist_metro 0.9849770 -1.4992275 -1.3276345
## foreigners 1.2334327 -1.5463887 -0.8592404
## prgreensp 0.4755941 -0.1580366 0.6884838
## popdens -1.1317643 0.9098664 -0.2688067
## museums 0.6064497 -0.9936772 -0.9460200
## airbnb -0.7560803 1.3343115 1.2634575
##
## p-values:
## Direct Indirect Total
## size < 2e-16 0.0825059 2.5722e-05
## age < 2e-16 0.0047031 3.8597e-07
## dist_metro 0.32464 0.1338146 0.18430
## foreigners 0.21741 0.1220107 0.39021
## prgreensp 0.63436 0.8744280 0.49115
## popdens 0.25773 0.3628930 0.78808
## museums 0.54422 0.3203801 0.34414
## airbnb 0.44960 0.1821018 0.20642
Hausman.test(SDEM_pr_md)
##
## Spatial Hausman test (asymptotic)
##
## data: NULL
## Hausman test = 229.24, df = 17, p-value < 2.2e-16
Reaching out the SLX models does not help, because although - as with the SDEM models - the indirect impacts (coefficients on lagged \(X\) variables) are large, so including lagged \(X\) variables especially at the properties level seems sensible, there is serious residual autocorrelation, and now the pre-test strategy points to a missing spatial process in the response:
SLX_pr <- lmSLX(f_pr, data=properties_in_dd, listw=lw, Durbin=TRUE)
summary(impacts(SLX_pr), short=TRUE, zstats=TRUE)
## Impact measures (SlX, estimable, n-k):
## Direct Indirect Total
## size 3.9424014 6.4922295 10.434631
## age -22.6778266 13.6377582 -9.040068
## dist_metro 0.5710406 -0.8727816 -0.301741
## ========================================================
## Standard errors:
## Direct Indirect Total
## size 0.4600099 0.8896271 0.90358915
## age 1.3642232 2.1630220 2.15389898
## dist_metro 0.3650804 0.3766962 0.07031201
## ========================================================
## Z-values:
## Direct Indirect Total
## size 8.570254 7.297698 11.547982
## age -16.623252 6.304956 -4.197072
## dist_metro 1.564150 -2.316938 -4.291458
##
## p-values:
## Direct Indirect Total
## size < 2e-16 2.9265e-13 < 2.22e-16
## age < 2e-16 2.8828e-10 2.7039e-05
## dist_metro 0.11778 0.020507 1.7750e-05
lm.morantest(SLX_pr, lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
##
## Moran I statistic standard deviate = 24.275, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.4505572652 -0.0029076149 0.0003489564
lm.LMtests(SLX_pr, lw, test=c("RLMerr", "RLMlag"))
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
##
## RLMerr = 3.9929, df = 1, p-value = 0.04569
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
##
## RLMlag = 54.185, df = 1, p-value = 1.825e-13
SLX_pr_md <- lmSLX(f_pr_md, data=properties_in_dd, listw=lw, Durbin=TRUE)
summary(impacts(SLX_pr_md), short=TRUE, zstats=TRUE)
## Impact measures (SlX, estimable, n-k):
## Direct Indirect Total
## size 3.6899279 5.1853598 8.8752878
## age -22.4862582 7.7433691 -14.7428891
## dist_metro 0.4883417 -0.6216376 -0.1332958
## foreigners 111.8693211 -144.2912677 -32.4219466
## prgreensp -7.9374540 33.7296361 25.7921821
## popdens -175.9679694 76.3367323 -99.6312371
## museums 132.9274926 -150.5428303 -17.6153377
## airbnb -1.6203107 2.0992718 0.4789610
## ========================================================
## Standard errors:
## Direct Indirect Total
## size 0.4374870 0.8640075 0.88540694
## age 1.2999436 2.1706052 2.20258646
## dist_metro 0.3469593 0.3606514 0.07353124
## foreigners 113.3256387 115.6423334 22.69480121
## prgreensp 98.1127344 99.5966256 16.36663957
## popdens 521.4661878 532.2960996 105.63203070
## museums 115.4990904 118.1243717 21.60621836
## airbnb 1.5181784 1.5638933 0.33577968
## ========================================================
## Z-values:
## Direct Indirect Total
## size 8.43437158 6.0015219 10.0239645
## age -17.29787276 3.5673779 -6.6934440
## dist_metro 1.40748996 -1.7236520 -1.8127784
## foreigners 0.98714927 -1.2477374 -1.4286068
## prgreensp -0.08090136 0.3386624 1.5758997
## popdens -0.33744847 0.1434103 -0.9431915
## museums 1.15089645 -1.2744434 -0.8152902
## airbnb -1.06727292 1.3423369 1.4264145
##
## p-values:
## Direct Indirect Total
## size < 2e-16 1.9548e-09 < 2.22e-16
## age < 2e-16 0.00036057 2.1798e-11
## dist_metro 0.15928 0.08477068 0.069866
## foreigners 0.32357 0.21212723 0.153117
## prgreensp 0.93552 0.73486404 0.115049
## popdens 0.73578 0.88596617 0.345583
## museums 0.24977 0.20250631 0.414906
## airbnb 0.28585 0.17948677 0.153749
lm.morantest(SLX_pr_md, lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
##
## Moran I statistic standard deviate = 22.047, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.3978202243 -0.0070160067 0.0003371825
lm.LMtests(SLX_pr_md, lw, test=c("RLMerr", "RLMlag"))
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
##
## RLMerr = 15.011, df = 1, p-value = 0.0001069
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
##
## RLMlag = 56.758, df = 1, p-value = 4.929e-14
So on balance, the pre-test strategy has not worked out too well; it is unclear what is missing in the model.
Turning to estimating the general nested model first, followed by excluding the Durbin (spatially lagged \(X\)) variables, a likelihood ratio test shows that the spatially lagged \(X\) variables should be retained in the model:
GNM_pr <- sacsarlm(f_pr, data=properties_in_dd, listw=lw, Durbin=TRUE,
control=list(pre_eig1=eigs, pre_eig2=eigs))
SARAR_pr <- sacsarlm(f_pr, data=properties_in_dd, listw=lw,
control=list(pre_eig1=eigs, pre_eig2=eigs))
lmtest::lrtest(SARAR_pr, GNM_pr)
## Likelihood ratio test
##
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 7 -7814.6
## 2 10 -7783.9 3 61.303 3.096e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again using a likelihood ratio test, the GNM model outperforms the SDEM model:
lmtest::lrtest(SDEM_pr, GNM_pr)
## Likelihood ratio test
##
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 9 -7850.3
## 2 10 -7783.9 1 132.7 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SDM_pr <- lagsarlm(f_pr, data=properties_in_dd, listw=lw, Durbin=TRUE,
control=list(pre_eig=eigs))
as is also the case with the SDM model:
lmtest::lrtest(SDM_pr, GNM_pr)
## Likelihood ratio test
##
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 9 -7842.4
## 2 10 -7783.9 1 117.06 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
and the SLX model:
lmtest::lrtest(SLX_pr, GNM_pr)
## Warning in modelUpdate(objects[[i - 1]], objects[[i]]): original model was of
## class "SlX", updated model is of class "Sarlm"
## Likelihood ratio test
##
## Model 1: y ~ size + age + dist_metro + lag.size + lag.age + lag.dist_metro
## Model 2: prpsqm ~ size + age + dist_metro
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 8 -8040.1
## 2 10 -7783.9 2 512.46 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Is the inclusion of the municipality department level variables in the GNM model justified?
GNM_pr_md <- sacsarlm(f_pr_md, data=properties_in_dd, listw=lw, Durbin=TRUE,
control=list(pre_eig1=eigs, pre_eig2=eigs))
No, not really:
lmtest::lrtest(GNM_pr, GNM_pr_md)
## Likelihood ratio test
##
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens +
## museums + airbnb
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 10 -7783.9
## 2 20 -7773.4 10 20.956 0.0214 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we drop the municipality department level variables from the Durbin term, we lose fewer degrees of freedom, so preferring the model including the municipality department level variables:
GNM_pr_md1 <- sacsarlm(f_pr_md, data=properties_in_dd, listw=lw,
Durbin= ~ size + age + dist_metro,
control=list(pre_eig1=eigs, pre_eig2=eigs))
lmtest::lrtest(GNM_pr, GNM_pr_md1)
## Likelihood ratio test
##
## Model 1: prpsqm ~ size + age + dist_metro
## Model 2: prpsqm ~ size + age + dist_metro + foreigners + prgreensp + popdens +
## museums + airbnb
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 10 -7783.9
## 2 15 -7775.5 5 16.794 0.004908 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Unfortunately, impacts are depressing here:
trs <- trW(as(lw, "CsparseMatrix"))
i_GNM_pr_md1 <- impacts(GNM_pr_md1, tr=trs, R=2000)
summary(i_GNM_pr_md1, short=TRUE, zstats=TRUE)
## Impact measures (sacmixed, trace):
## Direct Indirect Total
## size 3.23740995 7.2985861 10.5359961
## age -23.03565197 10.9935858 -12.0420662
## dist_metro 0.26484655 -0.4603147 -0.1954681
## foreigners -5.71030959 -23.1783890 -28.8886986
## prgreensp 5.63045876 22.8542711 28.4847299
## popdens -20.87711878 -84.7411114 -105.6182302
## museums -4.03610471 -16.3827204 -20.4188251
## airbnb 0.09458806 0.3839369 0.4785250
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
## Direct Indirect Total
## size 0.3675564 2.7172986 2.8701576
## age 1.0662779 6.2386275 6.6524247
## dist_metro 0.2172000 0.3157051 0.2011017
## foreigners 11.5239909 47.9039965 59.3689907
## prgreensp 8.4447492 34.9557814 43.3583761
## popdens 53.4568884 221.8429177 275.0663816
## museums 11.2137949 46.3673128 57.5286869
## airbnb 0.1731131 0.7184815 0.8906739
##
## Simulated z-values:
## Direct Indirect Total
## size 8.8005035 2.6735477 3.6581644
## age -21.5633924 1.7943591 -1.7735235
## dist_metro 1.2274254 -1.4816625 -1.0003479
## foreigners -0.4812273 -0.4726568 -0.4747901
## prgreensp 0.6779955 0.6605050 0.6645537
## popdens -0.3963452 -0.3823738 -0.3854135
## museums -0.3397487 -0.3362316 -0.3372235
## airbnb 0.5265902 0.5183114 0.5204562
##
## Simulated p-values:
## Direct Indirect Total
## size < 2e-16 0.0075054 0.00025403
## age < 2e-16 0.0727559 0.07614201
## dist_metro 0.21966 0.1384301 0.31714218
## foreigners 0.63035 0.6364580 0.63493658
## prgreensp 0.49777 0.5089298 0.50633600
## popdens 0.69185 0.7021841 0.69993106
## museums 0.73405 0.7366962 0.73594840
## airbnb 0.59848 0.6042411 0.60274564
The values and standard errors of the spatial coefficients suggest numerical problems in finding an optimum where the two coefficients are equally strong but with opposing signs:
c("response"=GNM_pr_md1$rho, "response se"=GNM_pr_md1$rho.se, "residual"=GNM_pr_md1$lambda, "residual se"=GNM_pr_md1$lambda.se)
## response.rho response se residual.lambda residual se
## 0.85971901 0.01547632 -0.86679315 0.04377824
If we fall back on the properties level only GNM, total impacts are only significant in conventional terms for size:
i_GNM_pr <- impacts(GNM_pr, tr=trs, R=2000)
summary(i_GNM_pr, short=TRUE, zstats=TRUE)
## Impact measures (sacmixed, trace):
## Direct Indirect Total
## size 3.3625439 9.2475200 12.610064
## age -22.7517496 16.8353068 -5.916443
## dist_metro 0.2927224 -0.6523253 -0.359603
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
## Direct Indirect Total
## size 0.3745881 2.8837104 3.0388158
## age 1.0767407 6.5778940 7.0048158
## dist_metro 0.2211039 0.3082192 0.1990367
##
## Simulated z-values:
## Direct Indirect Total
## size 8.988053 3.189002 4.1341687
## age -21.094711 2.605198 -0.7961408
## dist_metro 1.321073 -2.121656 -1.8179601
##
## Simulated p-values:
## Direct Indirect Total
## size < 2e-16 0.0014276 3.5624e-05
## age < 2e-16 0.0091821 0.42595
## dist_metro 0.18648 0.0338666 0.06907
The same problem occurs without the municipality department level variables; the impacts are being driven by the large spatial coefficient on the lagged response:
c("response"=GNM_pr$rho, "response se"=GNM_pr$rho.se, "residual"=GNM_pr$lambda, "residual se"=GNM_pr$lambda.se)
## response.rho response se residual.lambda residual se
## 0.88013826 0.01319649 -0.89698374 0.03914228
We cannot say that the spatial econometrics approach has reached a clear conclusion. When including the upper level variables, we introduce a lot of spatial autocorrelation at the lower level. It is arguable that the MRF random effect at the upper level and including only the properties level variables gets at least as far as the most complex spatial econometrics models. It is fairly clear that mapping the actual green space and museums, and measuring distance from each property to the attractions would remove the scale problem for those variables. Disaggregation of the foreigners, airbnb and population density variables would be highly desirable. With improvements to the properties level data set, including more variables describing the properties themselves, much of the mis-specification should be removed.